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Abstract 

Li-penalized regression methods such as the Lasso ( Tibshirani 19961 ) that 
achieve both variable selection and shr inkage have been very popu lar. An exten- 
sion of this method is the Fused Lasso (ITibshirani and Wand 120071 ). which allows 
for the incorporation of external information into the model. In this article, we 
develop new and fast algorithms for solving the Fused Lasso which are based on 
coordinate-wise optimization. This class of algorithms has recen tly been applied 
very successfully to solve Li-penalized problems very quickly ([Friedman et al 



20071 ). As a straightforward coordinate- wise procedure does not converge to the 



global optimum in general, we adapt it in two ways, using maximum-flow al- 
gorithms and a Huber penalty based approximation to the loss function. In a 
simulation study, we evaluate the speed of these algorithms and compare them 
to other standard methods. As the Huber-penalty based method is only approx- 
imate, we also evaluate its accuracy. Apart from this, we also extend the Fused 
Lasso to logistic as well as proportional hazards models and allow for a more 
flexible penalty structure. 



1 Introduction 

Li penalization methods have been used very successfully for variable selection as wel l 
as prediction. One of the most widely known methods is the Lasso (jTibshiranil Il996l ) 
which applies an L\ penalty to the coefficients in the model and minimizes the loss 
function 

^(y - X/3) T (y - X/3) + Ax X> fc |/3 fc | 

k=l 



where X G R nxp as well as y,/3 G W. Here, Wk > is some weight associated 
with coefficient (5k- By using this penalty, it is possible to fit models even if p > n, 
perform variable selection and improve prediction performance through shrinkage. In 
recent years, this model has been extended in ma ny ways, for example by allowing 
for grouping of variabl es fe.g. lYuan and Lin 120061 ). by including an additional ridge 



regression penalty (see IZou and Hastid 120051 ) or by choosing appropriate weights for 



the L\ penalty to improve its asymptotic properties (jZoull2006l ). to name just a few. 

Often, there is also some other prior information that could be incorporated into 
the model. An extension of the Lasso that takes advantage of such information is the 
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Fused Lasso ( ITibshirani and Wang 120071 ) . In its simplest form, it assumes that the 
coefficients (3 k are ordered and imposes an additional L\ penalty on the differences of 
neighbors. The loss function that is being minimized in this case is then 



p-i 



-(y - X/3) T (y - X/3) + Ai ^ w k \fa\ + A 2 ^ w Kk+1 \f3 k+1 - f3 k \ 



k=l 



k=l 



where A 2 is the parameter that controls the smoothness of the resulting solution. Most 
often, this model has been used in the form of the Fused Lasso Signal Approximator 
(FLSA) where we have that X = I, the identity matrix. This m odel can, for example, 
be us ed for Comparative Genomic Hybridization (CGH) data (see ITibshirani and Wang 
20071 ). For CGH data, at various points of a chromosome, a noisy estimate of the number 
of available copies of the genomic data is being taken. As duplications or deletions 
usually occur over large areas, the A 2 penalty smoothes the copy number estimate over 
neighboring data points, thereby increasing the accuracy of th e estimation . Another 
possible application of the FLSA is the denoising of images 
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Friedman et al. 2007) 



Recently, some theoretical properties of the model h a ve be en shown by iRinaldol ( 120091 ) 
and a path algorithm has been developed in iHoffingl ( 120101 ). 

In the above model, it is not necessary to restrict the A 2 -penalty to neighboring 
coefficients. To allow for greater flexibility, we penalize all differences of coefficients 
that correspond to an edge in a graph. To be more precise, let V = {1, ... ,p} be a set 
of nodes, each of which corresponds to one of the coefficients. Then we define a graph 
Q = {V,E) with edge set E and penalize all differences \(3 k — 0i\ for which (k,l) G E, 
i.e. for which the graph Q has a corresponding edge. Additionally, we also allow for 
different weights for each of the edges (and each of the \^ k \ penalties) so that we get 
the loss function 

1 P 
g(X,y,f3,w,g,X) = -( y _X/3) T (y-X/3)+A 1 ^ Wfc |/3 fc |+A 2 ]T w k ,i\p k -^\. (1) 

k=l (k,l)eE,k<l 

In addition to the squared error loss function above, we can of course also include other 
convex functions. Specifically, we will also look at logistic regression as well as the Cox 
proportional hazards models. 

To the best of our knowledge, there is currently no optimized algorithm available 
that minimizes the loss function in Equation (OQ). However, solving this model efficiently 
is important so that it can be applied in larger datasets or in situations where a large 
number of fits of a model is important (e.g. when bootstrapping a method that uses 
cross-validation). A method that has been very succes sfully used for the Lasso model 
is coordinate- wise optimization ( jFriedman et al.l 120071 ; see). In this article, we will 
present two algorithms that extend this technique to the Fused Lasso model, one exact, 
the other approximate. In Section [2J we will outline the algorithms and prove that it 
converges to a global optimum for the exact method. Its speed will be evaluated and 
compared to other methods in Section [3] The results will then be discussed in Secti on 
HI An implementation of this method is available from CRAN as FusedLasso [Honing . 
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2 The Algorithm 



In this section we want to present our algorithms for solving the Fused Lasso problem 
specified in Equation ([1]). We will do this in several steps. First, we will present the 
naive implementation of the coordinate-wise method that just optimizes each coordinate 
of the function g iteratively until it converges. However, this solution is not guaranteed 
to be the global optimum or even be unique and in general performs rather poorly in 
our setting (see Section [27Tj) . In order to avoid these problems, we force certain variables 
to be equal to each other (i.e. fused) and thus allow the coordinate-wise algorithm to 
overcome its limitations (see Sections 12.21 and 12. 3ft . To be more precise, for a set of 
fused variables F C we require that j3k = Pi for all k, I G F. There can 

also be several, non-overlapping sets F. These restrictions are then included in the loss 
function g to form a constrained loss function g, which is then being optimized using 
the naive coordinate-wise procedure. By choosing the sets correctly, our algorithm will 
converge to the global optimum, for which we provide the proof in the appendix. 

However, for models for which the number of variables is very large, the method 
used here can be relatively slow. Therefore, we also introduce an approximate method 
that addresses this problem and will be presented in Section 12.51 



2.1 Naive Algorithm 

A naive coordinate- wise optimization procedure for los s func tion (CQ) works analogous to 



m p_ 

the method for the Lasso described in lFriedman et al.l (120071 ) . For every k G {1, . . . ,p} 



/3k is being optimized while keeping all other coefficients constant. This is very simple 
as dg(K,y, (3,w,Q, X)/dflj. is piecewise linear, increasing step function with positive 
jumps. For these, the unique root can be found very easily. This simple procedure can 
be sped up considerably for cases where p > n as then most coefficients will always be 
and in general will not be changed by the coordinate optimization steps. In order to 
exploit this, an active set A is defined and the coordinate-wise procedure optimizes only 
over the variables in A, leaving out many variables that are and do not need to be 
considered in every iteration. In order to guarantee convergence, it is then necessary to 
regularly adapt the set A so that variables that could become non-zero in the next step 
are included and variables that have become are excluded. As this technique is not a 
central point of this article, we will not elaborate on it any further. The conditions for 
changing the set A can be found in Algorithm [1] and are easily derived in closed form. 
Furthermore, the und erlying concept of active variables is explained in more detail in 



Friedman et al.l (120071 ). 

Our first concern here is of course that the coordinate-wise algorithm converges at 
all. The following theorem shows that while a unique point does not necessarily exist, 
any cluster point of the algorithm is a coordinate- wise minimum of g. 

Theorem 1. Assume that the coordinate-wise algorithm starts at (3°. Let (3 r be the 
sequence generated by the coordinate steps for r G IN. Then (3 r is bounded and every 
cluster point is a coordinate-wise minimum of g. 

The proof can be found in the appendix. For the rest of the article, we will view a 
whole run of the coordinate-wise algorithm as a single step in our procedure. Accord- 
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Algorithm 1: Naive coordinate- wise optimization algorithm for the Fused Lasso. 
Data: y, X, (3 

Result: (3 optimizing g(X, y, (3, w, Q, A) 
{I- (%■ 

A:={k\fa^0}; 
repeat 
repeat 

foreach k G A do 

[ minimize g(X, y, (3, w, Q, A) over fa; 
end 

until converged; 

Set A := A U = and argming g(X, y, (3, w, A) ^ 0}; 

until ^4 not change; 



ingly, /3^ m ' will refer to the estimate of (3 after the m-th application of the coordinate- 
wise minimization. 

The problem with this naive implementation of the coordinate-wise algorithm is 
that it is not guaranteed to converge to a global optimum. Very often, it will be stuck 
in local coordinate-wise minima although an improvement of the loss function would be 
possible if several fa could be moved at the same time instead of separately. In order to 
incorporate this, we will fuse coefficients. When we refer to 2 neighboring coefficients 
as fused, we mean that we force them to be equal in the coordinate optimization steps 
and thus move together. More specifically, we define a set of coefficients to be a fused 
set as follows: 

Definition 1. Let Q\F be the graph Q = (V, E) restricted to the set F, i.e. Q\F = 
(F, {(k,l)\(k,l) G E fork, I G F}). Then a set F C {l,...,p} is called a fused set 
w.r.t. (3 if Q\F , is connected and fa = $ V k,l G F . 

Now, we will incorporate these fusions into our loss function. 
2.2 Incorporating the fused sets 

Let $ be a collection of fused sets, i.e. $ = {F\, . . . ,F\$\} where each Fi is a fused 
set according to the Definition [1] above. Furthermore, we assume that the Fi are non- 
overlapping and cover all coefficients, that is 

FiHFjVi^j and U 1 ^ F = {1, . . . ,p} 

Then, given these sets we want to solve the constrained optimization problem 

minimize g(X, y, (3, w, Q, A) 
subject to fa = f3i Vk, I G F t VF^ G 

However, as can be seen very easily, this is actually equivalent to 

minimize g(X, y, (3, w, Q, A) 

where we define X, y, (3, w, Q as follows: 
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X: Here, column i of X corresponds to the sum of the columns of X of the indices in 
Fj, i.e. 

y: Similar to above, we define 

Vi'=^2vk VFj G £. 

(3: Pi = flk for any k G Fj for all Fi G 

w: For the weights for Ai we set for each Fi G 5 

u>j := 2j ^fc 

and for the weights associated with A2 and Q define for each combination Fi, Fj G J 
for j ^ j that 

(/c,OI(fc,OG- B ; fc G- F i^eF j 
Here we define the graph as = (V, F) where 

V = {l,...,m} and 
£ = {(i,j)\3k G / G F,- where Fj, Fj and (jfe, /) G F}. 

It is easy to derive these adjusted input variables by replacing (3k by Pi for all k G Fj 
for every Fj G ^ in Equation ([T|) and then simplify the result. Therefore, we will omit 
a detailed derivation here. 

Using these fused input data, we can now apply the coordinate-wise algorithm to 
g until it converges. However, choosing the correct fusions is crucial for the algorithm 
to converge to the global optimum. In the definition of fused sets, we stated that is is 
necessary that they are connected w.r.t. Q and all have the same value. However, 
this does not imply that all coefficients with the same value that are connected have 
to be fused. This gives us some freedom in choosing the connected sets. Next, we 
will specify how to choose them in order to ensure that the algorithm converges to the 
global minimum. 

2.3 How to define the fused sets using a maximum- flow prob- 
lem 

In the coordinate-wise algorithm, the issue that may prevent the algorithm from con- 
verging to the correct solution is that, while it may not be possible to improve the loss 
function by just moving one coefficient, it may be possible to get an improvement by 
moving several at the same time. We assume that we have some current estimate f3^ m ' 
in step m, and we now want to identify sets Fj that can be moved in the next iteration 
of the coordinate-wise algorithm. 
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As a first step, we define the sets as all coefficients that are connected and have the 
same value. This is easily done by using the following equivalence relation ~: 

k ~ I = and k is connected to / w.r.t. Q 

only through nodes with value equal to (3^. 

Here, we suppress for notational convenience the dependence of the relation on Q as 
well as (3^ rn \ This equivalence relation is clearly reflexive, symmetric and transitive and 
therefore partitions the set {1, . . . ,p} into sets P 1 , . . . , P|^|. We call the collection of 
these sets *}3, i.e. 

^^{P 1 ,P 2 ,...,P H }. 

For each of these sets P, all pairs of coefficients k,l G Pi are then by definition of the 
equivalence relation connected in Q and have the same value of the coefficients (i.e. 

= A^)- Furthermore, they are maximal in the sense that they could not be 
enlarged and still have these properties. 

This partition is the starting point for our fused sets. We now want to specify in 
which situation these sets can stay as they are and when they have to be split into 
several parts. The latter situation occurs when such a split makes an improvement of 
the loss function possible. For this, we need to look at the effect of the A 2 -penalty inside 
the groups Pi separately. 

In order to do this, we slightly rewrite the loss function. In the following, please 
note that any conditions that are separated by commas are "and" conditions. 

1 P 
g(X, y,/3, w, Q, A) = -(y - X(3) T (y - X/3) + X, £ w k \(3 k \ + A 2 £ w k>l \(3 k - A| = 

k=l (k,l)eE,k<l 

i m 

-(y-X/3) T (y-X/3) + A 2 ]T £ w kJt \fi k - AI + 

i=i (k,i)eE,kePi,igPi,k<i 

m , m 

^i^2^2w k \(3 k \ + \ 2 ^2 ^2 w Ki\pk - ai = 

i=i k eP, i=i (k,i)eE,k,ieP t ,k<i 

h(x, y , /3, w, g, a, qj) + Ax Yl W *\P*\ + X *Y1 E w w\P* 

i=i fceP l i=i (k,i)eE,k,iePi,k<i 

where we collect everything in the function /i(X, y, /3, w, Q, A, except for the effect 
of the Ai-penalty and the A2-penalty inside the groups Pi. Then by the definition of 
the Pi, we see that h is differentiable. 

We now want to find out, if we should split the set Pi, where we assume for simplicity 
first that Ac 7^ f° r k £ Pi- As a mental image, we imagine the elements in Pj as balls. 
They are connected as defined by the edges in Q with strings, which have tensile strength 
^2W k i. On each ball, we now apply a force of dh/df3 k + Aiu>fcsign(/3fc) for k G P«, where 
the sign determines if the force is applied upwards or downwards. The question we 
want to answer is if some of the strings will break and if they break, which of the balls 
will still be connected to each other. If the balls break apart into separate groups, then 
we should also split Pj into the corresponding groups. 
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We can specify this mental image in the form of a maximum flow problem. For 
this restrict graph Q to the set p and add a source node s (pulling upwards) and a 
sink node r (pulling downwards). Our new graph is then Qf 1 = (V^ , E^), where 
y^ 1 = p i u { r , s}. For the edges, we define 

Ef ={(k,l)\k,le Pi and (k, I) G E}U 

{(k, s), (s, k)\dh/dp k + Ax^sign^) < 0}U 
{(k, r), (r, k)\dh/d/3 k + A lWfc sign(/3 fc ) > 0} 

such that we keep all the edges in Q that lie in p and add an edge to the source if 
we pull upwards on the coefficient and an edge to the sink if we pull downwards. For 
the maximum flow problem to be complete, we now only have to specify the capacities 
on each of the nodes. These correspond to the tensile strength of the cables in our 
little mental image for edges between nodes in Pj or the strength at which we pull on a 
node for edges to either the source or the sink. More specifically, let c k i be the capacity 
between nodes k and /. Then 



cm = ci k = X 2 w k i for (k, I) G Ef 1 and k, I G p, 
' dh 

M 



dh 

°sk = -[ + \iw k sign((3 k ) ) and c ks = for (k, s) G E™ , 



and 



dh 

°kr = [ ww- + Aiw fc sign(/3 fc ) ) and c rk = for (k, r) G Ef 



For this graph we find the maximum possible flows and then look at the residual 
graph Qf 1 . In our mental image, the maximum flow corresponds to the maximum force 
we can exert before some strings start breaking. If some edges from the source still have 
residual capacity, this means that we could pull harder upwards, therefore breaking off 
some nodes - exactly the ones that are still connected to the source in the residual 
graph. The same is true for available residual capacity to the sink. Therefore we define 

F i+ = {k G Pi\k is connected to s in Qf} 
Fi_ = {k G Pi\k is connected to r in Q^} 
Po = P\(P + U P_) 

These sets are not necessarily connected, in which case we separate them into their 
connected components. These are then the new fused sets, based on which we define 
the fused matrices and vectors and run the component-wise algorithm. 

For the sets P for which /3 k — for k G Pj, things are just a little more complicated, 
as we also have to overcome the Ai-penalty. This means that we have to define two flow 
graphs, one for trying to get (3k to be positive, one for making fl k negative. The first 
graph is Q^_, which is defined analogous to Qf 1 above, except that instead of dh/d/3 k + 
XiWksign(f3k) we use dh/df3k + \iw k , i.e. assume that sign(/3fc) = 1. Correspondingly, 
we also define Qf[ using dh/dj3 k — ^iw k , i.e. under the assumption that sign(/3fc) = —1. 
With and the correspond residual graph we can now determine if we can break 
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of a set that will become positive and using Qfz. if a set can become negative. These 
are then defined as 

Fi + = {k E Pi\k is connected to s in Gf + } 
= {k G Pi | k is connected to r in Q^_\ 
F i0 = P l \(F l+ UF^). 

This now specifies how to generate the fused sets $ using (3 and In the next section, 
we will now present the complete algorithm and a proof that it converges to the global 
optimum can be found in the appendix. 

2.4 The complete algorithm 

We now have specified everything that we need in order to run the algorithm. We 
know how to define the sets and $ as well as given the grouping of the coefficients 
how to calculate X, y, w and Q in order to apply the component-wise algorithm to 
g = g(X, y, f3, w, Q, A). Then, given the solution of the component-wise run, we specify 
a new grouping. We repeat these steps until convergence. 

As the application of the maximum flow algorithm is relatively expensive, we add 
one more rule in order to apply it as little as possible. 

Fuse sets First, we do not check any sets if they have to broken up and just set 
Fi = Pi, i = 1, . . . , |?p|. Run the component-wise algorithm. 

Split active sets If we previously used the Fuse sets rule and f3 did not change, 
then check sets Pi with /3k ^ for k G Pi if they should be split up. Run the 
component-wise algorithm. 

Split inactive sets If we previously used the Split active sets rule and (3 did not 
change, then check all sets with (3^ = for A; G Pj. Run the component- wise 
algorithm. 

Using this we can avoid to calculate maximum flows until we have exhausted the much 
easier and computationally cheaper fusion of sets. The complete procedure is given in 
more detail in Algorithm [2j 

Now we can prove that this algorithm is guaranteed to converge to the global opti- 
mum. 

Theorem 2. Assume we want to minimize expression^ Then Algorithm\^is guaran- 
teed to converge to the global optimum. 

The proof of this theorem can be found in the appendix. 

2.5 Alternative algorithm using Huber penalty 

The algorithm presented above relies for convergence on solving maximum-flow prob- 
lems. When these are very large, they can become the bottleneck of the computation 
and slow it down. Therefore, we propose a second algorithm that works without any 
maximum-flow computations at the price of only yielding approximate solutions. 



8 



Algorithm 2: Coordinate-wise optimization algorithm for the Fused Lasso with 
grouping step. 



Data: y, X, (3 

Result: (3 optimizing g(X, y, (3, w, Q, A) 

Fi :={i}i = l,...,p; 
repeat 

Based on Ff, % — 1, . . . , \F\ calculate f3, X, y, w and Q; 
Apply naive coordinate- wise to g = g(X, y, (3, w, Q, A) 
Determine the new sets Fi as described above according to Fuse sets, Split 
active sets and Split inactive sets rule; 
until Ff, i = 1, . . . , \F\ did not change; 



The main part of the algorithm stays the same. We only replace the Split active 
sets and Split inactive sets rule presented in Section [23] by an application of the same 
problem using a Huber penalty with some M > 0, defined as 



for - 1/M < x < 1/M 
I \x\ — jj-j- otherwise 



M 2 
2 ^ 



instead of an L\ penalty on the differences of coefficients. It is easy to verify that Pm{x) 
is continuous and differentiate everywhere and that Pm{x) — >■ \X\ as M — > oo. The 
loss function that we aim to optimize then is 

g M (X, y, (3, w, Q, A) =\{y - X/3) T (y - X/3) + 

v 

Ai ^W/t|/3fc| + A 2 ^ w kt ip M {Pk - A)- (2) 

fe=l (k,l)£E,k<l 

As for (jfj\/ the penalty on differences — is now differentiable everywhere, it can 
be shown tha t a naive co ordinate-wise optimization algorithm converges to a global 



optimum (see lTsengll200l[ ) and that for M — > oo, this global optimum converges to the 
global optimum of the Fused Lasso problem in Equation ([T|). However, we will not use 
the approximation as given above directly to solve our problem. The reason is that the 
convergence using the approximation in Equation ([2]) can be very slow when using a 
coordinate-wise algorithm. However, as it is differentiable everywhere, the coordinate- 
wise algorithm cannot get stuck. Therefore, we will use this approximation if the naive 
algorithm presented in Section [1] gets stuck in a coordinate-wise optimum that is not 
the global optimum. 

To be more precise, for our algorithm we first use the naive coordinate-wise version 
presented in Algorithm [TJ After each convergence, we fuse all coefficients that are 
equal to each other and connected (i.e. the fused sets are just as defined in Section 
2.3p and reapply the naive coordinate-wise procedure until nothing changes anymore. 
At this point, we may be stuck at a coordinate- wise minimum that is not the global 
optimum. In order to get "unstuck", we apply the naive coordinate-wise procedure to 
the loss function with Huber penalty gM for some value of M for a number of iterations 
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K. The precise value of M is not very important and we set it to 1000. As we already 
mentioned above, convergence of gu can take very long so that we stop it after K 
iterations (we set K=100), as this is usually sufficient to become "unstuck". We iterate 
this procedure until (3 has not changed more than e > according to some metric (e.g. 
Li) since the last application of the huberized loss function gM- 

The algorithm as presented above is not guaranteed to converge, however as we will 
see in the simulations section, it usually comes very close to the global optimum. The 
complete description can be seen in Algorithm [3l 



Algorithm 3: Coordinate-wise optimization algorithm for the Fused Lasso em- 
ploying Huber penalty. 
Data: y, X, (3 

Result: j3 optimizing g(X, y, /3, w, Q, A) 
/3:=/3 ; 

save • Pqi 

Fi :={i}i = l,...,p; 
repeat 

Based on Ff, % — 1, . . . , \F\ calculate f3, X, y, w and Q\ 
Apply naive coordinate- wise alg. to g = g(X, y, /3, w, Q, A); 
Set t v ^: 

if $ did not change then 

Apply naive coordinate-wise alg. for at most K steps to 
g M (X,y,(3, w,G, A); 

(3 save 

end 

until ||/3 - (3 save ||i < e; 



All of these algorithms are not restricted to using squared error loss and there- 
fore we will quickly discuss some extensions to logistic regression aw well as the Cox 
proportional hazards model. 

2.6 Extensions to logistic regression and the Cox proportional 
hazards model 

Of course, the Fused Lasso model is not restricted to squared error loss. Just as with 
the Lasso or the elastic net, we can apply the same mechanism also to generalized linear 
models. By using iteratively reweighted least squares (IRWLS), we can then leverage 
our algorithm for solving the Fused Lasso with squared error loss to generalized linear 
mod els. Here, we will br iefly describe the implementation for logistic regression (see 



also IFriedman et al.ll2010l ) and the Cox proportional hazards model. 



2.6.1 Logistic regression 

In the setting of a binary response variable Y (with values and 1), logistic regression 
can be used. Here, the conditional probability of success conditioned on predictor vector 



10 



x is usually expressed as 

1 



P(Y = l|x) 



1 + exp(-x T /3) 

together with the Fused Lasso penalties, the function that we want to minimize is 

n p 

-^log(P(F = 2/ fc |x fc )) + Ai^«; fe |/3 fc | + A 2 ^ w k i\/3 k - 

k=l k=l (k,l)eE;k<l 

In order to do this, we iteratively apply a quadratic approximation to the log-likelihood 
— J^™ =1 log(P(y = |/j|xj)) at the current estimates in step m, (3^ m \ The exact details 
of this derivation can be looked up in many textbooks on generalized linear models 



see e.g. iNelder and McCullaghlll989l ) and therefore we omit them here. The resulting 



quadratic approximation has then the form 

^ n p 

-22vk(zk - Xfc/3) 2 + Ai y^Wfc|/3 fc | + A 2 2J w k i\f3 k -(5i\ 

k=l k=l (k,l)sE;k<l 

where the working response z and the weights v are given by 



p(x fc ,/3 (m) )(l-p(x fc ,/3 M )) 
v k =p(x k ,[3^)(l-p(x kl [3^)). 

2.6.2 Cox proportional hazards model 

Another important regression model that is often used in many applications is the Cox 
proportional hazards model. For simplicity, we assume that there are no ties between 
the event times ti. Then the partial likelihood function is given by 



ti VEi€H(t,) eX P( X i' 3 ) 



where t k are the event times, 5 k = if the observation is censored and 6 k = 1 otherwise, 
and R(t k ) = {I : ti > t k } are the sets of individuals at risk at time t k . 

As in the case of the Logistic regression model, we also want to approximate the 
Cox model by a quadratic function around the current estimate of /3^ m \ If we let 

dlogL((3^) d 2 \ogL(f3^) 

d= — m — Q = w? 

then the quadratic approximation to — log L is 

-logL(/3 (m) ) -d T (/3-/3 (m) ) - (/3-/3 (m) ) T Q(/3-/3 (m) ). 
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The exact form of d and Q is easy to evaluat e and can be found in many standard 
textbooks (see e.g. lHastie and Tibshiranilll990l ). Specifically they are 



Ta(m)\ 



t EleAt*) ex P( x / Z 3 



Ta(mp 



Q = X T WX with 



W k k 



W k k' 



exp(x fc 



l-.keRi 



E,^ex P (xJ/3^; 



+ 



exp(2x£/3) 



(E i6fli exp(^M) J 



exp(x^ (m) )exp(x£/3 (m) ) £ 



"M'e* (E^ exp(xJ/3^)) 



2 ' 



The main problem is that W is not a diag onal matrix, which gre a tly in creases the com- 
putational complexity of the calculation. lHastie and Tibshiranil (119901 ) suggest to just 
use the diagonal of matrix W for the computations, which we will do. However, as we 
are not optimizing the step size of the algorithm, this approach may lead to convergence 
problems, especially in the case of highly correlated predictors. As an ad-hoc solution 
to this problem, we use an additional diagonal matrix D such that our approximate 
Hessian Q is then 

Q = DX T DiagWXD 

where D is chosen such that Diag(Q) = Diag(Q). While we do not have a rigorous 
theoretical justification, we found that in practice this gave good convergence of the 
algorithm, while only requiring the computation of the diagonal of Q, which is compu- 
tationally still feasible. 



3 Simulations 

In this section, we want to demonstrate the speed of our new algorithm. For this, we 
will simulate data and compare the speed of the naive, the maximum-flow based and 
the Huber penalty based Fused Lasso algorithms to other methods discussed below. 
Additionally, as the naive and Huber penalty based methods are only approximate, 
we will also evaluate their performance in terms of accuracy with respect to the exact 
Fused Lasso solution. 



3.1 Methods used 



To the best of our knowledge, there is currently no specialized software available that 
solves the Fused Lasso for general matrices X and general graphs Q. Therefore we 
will use the CVX, a package for specifying and solving convex programs in Matlab 
(IGrant and Boydll2010t 120081 ). It is very versatile and is in our opinion the type of 
off-the shelf software that would be used to solve the Fused Lasso in absence of other 
specialized software. 

As the current algorithm is an extension of a coordinate-wise algorithm, we will 
also compare the speed of our method to an implementation of this approach, namely 
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the package glmnet from iFriedman et al.l (120071 ). The glmnet package implements a 
coordinate- wise descent algorithm to solve the elastic net, a combination of the lasso 
and ridge regression (see lZou and Hastidl2005l ). It minimizes the loss functions 



p p-i 
-(y - X/3f(y - X/3) + Xa^M + A(l - a) ^ w k (3 2 k . 

k=l k=l 

Due to the simpler structure of the loss function that is being optimized and the there- 
fore simpler algorithm, we expect the glmnet package to be faster than our implemen- 
tation. 

3.2 Continuous response data example 

For the simulation we have to generate a predictor matrix X and a vector of true 
coefficients f3. 

Here we distinguish if we want to generate data for simulations for a one- or a 
two-dimensional graph. 

One-dimensional data For the predictor matrix, all observations are being gen- 
erated independently of each other. Let p be the number of columns in the predictor 
matrix. In order to give the matrix some structure, we set the matrix to a value different 
from for some intervals before adding Gaussian noise to it. For observation i let rij ~ 
Poisson( v /p/2) be the number of such intervals, which have length ~ Poisson( v /p), 
starting position s^- ~ U(2 — ly,p) and value ~ £/({— 3, —2, —1, 0, 1, 2, 3}), all inde- 
pendent for i — 1, . . . , n; j — 1, . . . , rij. For this we then set 

X ik = Vij + 7^ for Sij < k < Sij + lij — 1; i — 1, . . . , n; j — 1, . . . , n, 

where 7^ ~ N(0, 1) i.i.d. and in case of multiple assignment the one with the highest 
index j takes precedence. 

For the vector of coefficients, we choose such that f3 is except for a sequence of 
length 100 in the middle. 

Our response is then as usual 

Vi = (X(3)i + Si 

where £j ~ N(0, a) for a = 10. 

Two-dimensional data We generate the two-dimensional data in a similar way. For 
ease of notation, we assume that the predictor matrix is a 3-dimensional array (and 
cast it back into a 2-dimensional matrix when we specify y). Here let X G R nxp and 
for observation i we have rii ~ P°^ s {\^p) again the number of boxes. The boxes have 
a random length on both axis which is drawn from \ 1^ ~ Pois( v /p) and a uniform 

starting position on the axis using ~ U{2 — llj\p) and sj^ ~ U(2 — lj?\p). The 
value of the box we again draw from ~ U({—3, —2, —1,0, 1,2,3}) where all these 
random variables are all independent for i = 1, . . . , n; j = 1, . . . , rij. Using all this we 
set 

X ik(1)k(2 ) = Uii+7ifcd)*(a) for 4 1>2) ^ fc(1 ' 2) ^ S S' 2) +4 1>2) _1 ' i = 1, ■ ■ ■ ,n;j = 1, . . . ,rii 
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where again 7^(1)^(2) ~ N(0, 1) i.i.d. and highest index j takes precedence. 

We also interpret the vector of coefficients (3 as two-dimensional and pick a square 
in the middle of length 10, i.e. 




otherwise 



1 if p/2 - 4 < k 1 ' 2 < p/2 + 5 



and the response is as before 



Vi = (X/3)i + £i 



where ~ N(0, a) for a = 10. 



3.3 Speed 



We now want to compare the speed of the algorithm and its approximations with CVX 
and glmnet. In this comparison, it is important to note that glmnet solves a simpler 
problem than the Fused Lasso, but we include it as a faster benchmark. 

Here, we run the algorithm for 20 values of A2 that span the relevant range from 
[A^ aa: /10000, \™ ax ] using an exponential grid (i.e. there are about 5 values for A2 for 
each order of magnitude). Here \™ ax is the value for A2 for which when Ai = all 
coefficients have the same value. For each of the A2-values, we compute the solution 
for 50 values of Ai ranging in an exponential grid from [A^^/IOOOO, A™ ax ] (i.e. ~ 12 
values for every factor 10), where \™ ax is defined equivalently to \™ ax , guaranteeing 
that we cover the entire relevant range of values. In all these algorithms, we stop the 
computation when more than 2 • n elements of f3 are non-zero as these results would 
be considered especially unreliable and computation can become very slow with very 
many active variables. 

The results can be seen in Tables [1] and EJ Any version of the Fused Lasso algorithm 
is orders of magnitude faster than the results of the CVX package, thereby making it 
possible to run the Fused Lasso for larger problems. Also, as expected, the glmnet 
package is faster than any of the Fused Lasso methods. This is easily explained as 
the underlying mathematical problem that the glmnet procedure solves is considerably 
simpler than the Fused Lasso model. As discussed before, we included it as a faster 
benchmark. For the different Fused Lasso procedures we can see that naive implemen- 
tation is up to an order of magnitude faster than the exact procedure that is based on 
the maximum-flow algorithms. However, as we will see later the accuracy of the naive 
method can be quite poor. The Huber penalty based approach is usually as fast as the 
maximum-likelihood based method, but has speed advantages in cases where p » n. 
This can also be seen in Table EJ 

3.4 Accuracy 

In Section [2] we already mentioned that the naive algorithm as well as the Huber penalty 
based version are not guaranteed to converge to the global optimum. Therefore, we 
want to assess how far away from the global optimum, that we get with the exact 
maximum-flow based algorithm, the solutions are. As the solutions and therefore also 
the error are dependent on the choice of the penalty parameter, we report the worst-case 

scenario over the whole grid of Ai and A2 values. If we let f3 (Ai, A2), (3 (Ai, A2) 
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Param. 


Other 


FusedLasso 


n 


P 


Glmnet 


cvx 


exact 


naive 


TT 1 

Huber 


100 


1000 


1.14 


5883 


19.39 


3.32 


16.90 


100 


3000 


3.71 


21363 


87.73 


10.11 


50.67 


100 


5000 


6.90 


40691 


196.56 


15.67 


90.82 


200 


1000 


2.09 


16582 


37.99 


7.47 


39.08 


200 


3000 


6.72 


65328 


132.66 


20.16 


94.57 


200 


5000 


12.15 


122197 


277.13 


32.17 


176.19 


oUU 


1 nnn 
1UUU 


a no 
4.UZ 


1 nvson 


i nt; /in 
1U0.4U 


OA 1A 
z4. 1 4 


ns 
144. yo 


600 


3000 


18.37 




345.16 


58.01 


331.57 


600 


5000 


27.27 




576.41 


82.46 


482.28 


1000 


1000 


73.80 


303187 


204.34 


49.60 


321.34 


1000 


3000 


27.59 




574.31 


98.23 


618.09 


1000 


5000 


51.17 




1042.83 


163.51 


1007.66 



Table 1: Speed of the algorithms for continuous data with one-dimensional graph in 
seconds. 



Param. 


Other 


FusedLasso 


n 


P 


Glmnet 


CVX 


exact 


naive 


Huber 


100 


30x30 


1.55 


19867 


63.05 


13.10 


34.59 


100 


50x50 


5.93 


29058 


264.62 


26.36 


97.06 


100 


70x70 


12.32 


64653 


780.88 


37.48 


160.04 


200 


30x30 


3.07 


42311 


126.37 


30.22 


83.87 


200 


50x50 


10.18 


67770 


374.12 


52.49 


173.29 


200 


70x70 


19.32 


165690 


933.70 


62.30 


267.12 


600 


30x30 


4.67 


134887 


173.62 


43.36 


152.36 


600 


50x50 


25.51 




728.01 


125.85 


493.39 


600 


70x70 


35.94 




1584.98 


166.30 


686.73 


1000 


30x30 


51.61 


275220 


209.46 


53.16 


195.20 


1000 


50x50 


36.43 




1179.99 


214.05 


848.18 


1000 


70x70 


36.03 




1807.08 


221.78 


895.51 



Table 2: Speed of the algorithms for continuous data with two-dimensional graph in 
seconds. 



Param. 


Other 


FusedLasso 


n 


P 


Glmnet 


exact 


naive 


Huber 


100 


10000 


31.25 


753.5 


65.87 


291.633 


100 


20000 


60.62 


2509.4 


115.89 


515.66 


100 


100x100 


24.21 


3161.8 


50.6 


289.1 


100 


150x150 


46.55 


22164.49 


86.6 


502.1 



Table 3: Speed of the algorithms for continuous data with one- and two-dimensional 
graph for large number of variables in seconds. 
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Param. 


II • 1 


i/p 


RMSE 


II ' 1 1 oo 


n 


P 


naive 


Huber 


naive 


Huber 


naive 


Huber 


100 


1000 


0.088 


0.0077 


0.19 


0.016 


1.36 


0.16 


100 


3000 


0.033 


0.0040 


0.11 


0.0096 


1.39 


0.11 


100 


5000 


0.018 


0.0043 


0.079 


0.011 


1.16 


0.15 


200 


1000 


0.077 


0.0023 


0.13 


0.0044 


0.80 


0.044 


200 


3000 


0.030 


0.0025 


0.083 


0.0052 


0.84 


0.077 


200 


5000 


0.0188 


0.0028 


0.057 


0.0060 


0.90 


0.080 


600 


1000 


0.16 


0.030 


0.27 


0.034 


1.17 


0.25 


600 


3000 


0.023 


0.0010 


0.045 


0.0022 


0.38 


0.026 


600 


5000 


0.013 


0.0014 


0.032 


0.0031 


0.46 


0.034 


1000 


1000 


0.16 


0.024 


0.27 


0.032 


1.05 


0.30 


1000 


3000 


0.017 


0.0017 


0.046 


0.0033 


0.50 


0.077 


1000 


5000 


0.008 


0.0012 


0.023 


0.0024 


0.37 


0.026 



Table 4: Accuracy of the algorithms for continuous data with one-dimensional graph. 
Accuracy is measured as 1 1 • \ \i/p, root mean-squared error (RMSE) and | | • | |oo, where for 
each of these measures, the worst result over all Ai, A2 combinations is being reported 
and averaged over the 10 simulations. 



and f3 (Ai, A2) be the solution for penalty parameters (Ai, A2) for the maximum- flow 
algorithm (exact solution), Huber penalty approach and naive algorithm respectively, 
then we define the reported accuracy as 



max err 

Ai,A2 



exact 

/3 (Ai,A 2 ) 



"Huber , , , , 

(3 (Ai,A 2 ; 



and 



max err 

Ai,A2 



,exact 



(3 (Ai,A 2 )-/3 (Ai,A 2 ) 



for the error functions 



1 P 

err(x) = — \ 

y i=l 



\x\\ x /p 



err(x) 
err(x) 



1 p 

- x f = RMSE 



max | x — x 00 • 

i=l,...,p 



The accuracies for the one-dimensional and two-dimensional simulations can be 
seen in Tables [4] and |5j As we can see there, the accuracy of the Huber penalty based 
approach is in general very good, showing small errors for || • ||i/p and the RMSE. 
Looking at || • ||oo, we can see that even for every single component, errors are usually 
below 0.1 (the non-zero elements in the true f3- vector are set to 1). Overall, the Huber 
penalty based approximation gives a good performance, while the error for the naive 
approach is usually an order of magnitude larger and in some cases very high (see 
II ■ I loo-measure). 
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Par am. 


II ' 


[i/p 


RMSE 


II ' 1 1 oo 


n 


P 


naive 


Huber 


naive 


Huber 


naive 


Huber 


100 


30x30 


0.052 


0.016 


0.13 


0.045 


0.83 


0.35 


100 


50x50 


0.015 


0.0051 


0.066 


0.024 


0.85 


0.32 


100 


70x70 


0.008 


0.0021 


0.049 


0.013 


0.89 


0.25 


200 


30x30 


0.048 


0.0085 


0.095 


0.018 


0.59 


0.11 


200 


50x50 


0.019 


0.0035 


0.062 


0.012 


0.64 


0.13 


200 


70x70 


0.009 


0.0017 


0.042 


0.0083 


0.54 


0.13 


600 


30x30 


0.18 


0.0062 


0.28 


0.012 


0.94 


0.091 


600 


50x50 


0.026 


0.0017 


0.048 


0.0045 


0.31 


0.048 


600 


70x70 


0.013 


0.0013 


0.034 


0.0039 


0.32 


0.051 


1000 


30x30 


0.17 


0.0074 


0.26 


0.011 


0.84 


0.084 


1000 


50x50 


0.023 


0.0012 


0.037 


0.0024 


0.26 


0.035 


1000 


70x70 


0.015 


0.0011 


0.029 


0.0026 


0.23 


0.036 



Table 5: Accuracy of the algorithms for continuous data with two-dimensional graph. 
Accuracy is measured as 1 1 • \ \i/p, root mean-squared error (RMSE) and | | • | |oo, where for 
each of these measures, the worst result over all Ai, A2 combinations is being reported 
and averaged over the 10 simulations. 

4 Discussion 

In this article we have presented two novel algorithms based on coordinate-wise opti- 
mization that solve the Fused Lasso, both of which are considerably faster than currently 
available methods. For the maximum-flow based approach we have proven that it is 
guaranteed to converge to the global optimum, however, for problems with large num- 
bers of variables this can be slow due to the complexity of maximum-flow algorithms. In 
order to remedy this problem, we also introduced a Huber penalty based procedure that 
does not rely on maximum-flow problems and shows better performance for situations 
with p 3> n. 

Apart from this we also extended the Fused Lasso to allow for more general penalty 
structures by using arbitrary undirected graphs and weights as well as more response 
types by implementing logistic regression as well as the Cox proportional hazards model. 
An implementation of this algorithm will be provided in the R package FusedLasso that 
will be available on CRAN. 



A A brief introduction to subgradients 



In convex optimization, often some of the functions in a problem are not different iable 
everywhere. In this case, instead of a regula r gradient , we ca n use a subgradient. The 
following introduction is mostly taken from iBertsekad (119991 ) . As a starting point, we 
first want to define what a subgradient is and then describe a condition that guarantees 
that a convex function has a minimum at a point x. Afterwards, we will derive the 
relevant subgradient expressions that are being used in this article. 



Definition 2. Given a convex function f : R" — > R, we say a vector d G R n 



is a 
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subgradient of f at a point x G R n if 

f(z)> f(x) + (z-x)'d, VzGR". 

// instead f is a concave function, we say that d is a subgradient of f if —d is a 
subgradient of — f at x. The set of all subgradients of a convex function f at x G R n is 
called the sub- differential of f at x, and is denoted by df(x). 

For subgradients, some basic properties simila r to regular grad ients hold and the 
proof to the following Proposition can be found in iBertsekasI (119991 ) on pp. 712-716. 

Proposition 1. Let f : R™ — > R be convex. For every x G R n , the following hold: 

1. If f is equal to the sum fi + - ■ ■+f m of convex functions fj : R n — > R, j = 1, . . . , m, 
then df(x) is equal to the vector sum df\(x) + • • ■ + df m (x). 

2. If f is equal to the composition of a convex function h : R m — > R and an m x n 
matrix A, that is [f(x) = h(Ax)], then df(x) is equal to A'dh(Ax) = {A'g : g G 
dh(Ax)}. 

3. x minimizes f over a convex set A C R n if and only if there exists a subgradient 
d G df(x) such that 

d\z - x) > 0. \/z G A. 

In our case here for the convex functions we optimize over the set A = R p and 
therefore the last statement says that x minimizes a function / over W if and only if 
a subgradient d G df(x) exists such that 

d = 0. 

Now, by using the proposition from above, all we need to calculate the subgradient 
of the loss function is the subgradient of f(x) = \x\ which is df(0) = [—1, 1] and 
df(x) = sign(x) for i^O. Therefore, the subgradient of the loss function 



1 P 

-(y - X/3) T (y - X/3) + X 1 ^ w k \(3 k \ + A 2 ^ w kl \f3 k - A 

w.r.t. f3 k is 



2 

k=l {k,l)eE,k<l 



df(/3k) ^ . Wk ~ Pi) 
A2 2^ Wkl m 



k=l ' k (k,l)€E,k<l 



and using the optimality condition from above, a solution (3 is optimal if there exists 
s k , t k i such that 

-x£(y - X(3) k + Ais fc + A 2 ^ t kl for k = l,...,p 

(k,l)£E,k<l 

where s k = sign(f3 k ) for ft^O and s k G [—1, 1] otherwise. Similarly t k i = sign(/3/ c — f3i) 
for f3 k = (3i and t k \ G [—1, 1] otherwise. For notational convenience, we set t\ k = —t k i 
for all k < I. 
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B Proof of Theorem U 

Before we start with the main proof, we show the following lemma, where we guarantee 
that the step size of each coordinate-move in the naive coordinate-wise algorithm has 
to converge to 0. 

Lemma 1. Assume that in the coordinate-wise algorithm we are starting at point (3 
and are optimizing coordinate k$. Let be the new estimate after the optimization of 
coordinate fco, i-e. (3 k = Pk for all k ^ fco- Then there exists a constant a such that 

90) < g{P) - a(P ko - h f. 

The value of a is independent of 'fco and the starting point (3. Here, we have suppressed 
the dependence of g on other variables in the model for notational convenience. 

Proof. If we just consider function g((3) as a function of (3k , then it has the form 

K 



q(/3k ) = akoWko - b) 2 + ^2di\/3 ko - a 



i=0 

where a ko = (X T X) koko , di > and b, q G R. The function q is differentiable everywhere 
except at point q and has the derivative 

q'(/3 ko ) = 2a ko (/3 ko - b) - ^ di 

i\Pk Q <Ci i\Pk >Ci 

which is clearly a piecewise linear function with slope 2a ko and jumps for (3 ko = q with 
height 2di > 0. For the minimum f3 ko we therefore know that 

lim q'(P ko )<0< lim q'(f3 k() ) 

and as all jumps are positive it holds 

q'Wk ) > 2a ko (/3 ko - /3 ko ) for /3 ko > /3 ko 

and 

q'(Pko) < 2a ko (/3 ko - /3 ko ) for ko < /3 ko . 
However, from this it immediately follows that 

q(Pko) > q(Pko) + ak ((3k - (3k ) 2 - 

This claim still holds if instead of a ko we use a = m.m k a k . As g(f3 ko ) = q((3 ko ) by 
construction, the claim follows. □ 

We now go on to prove Theorem [TJ The proof closely follows parts of the proof of 



Tsengl ( 1200 ll ; Theorem 4.1). 
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Proof. Assume that our algorithm starts at fx* ' and let (3 r be the r-th step of the 
component- wise algorithm. Note that B = {(3\g((3) < g(/3 {0) )} is a compact set and as 
each coordinate-wise step only decreases g, it follows that f3 r G B for all r. 
Now let r G 1Z be a converging subsequence with 

lim f3 r = 8. 

For each j G {1, . . . ,p}, consider the subsequence (3 r+:) for r G 1Z. As g is continuous 
and g{f3 r ) is monotonically decreasing, we get 

lim g((3 r ) = g(S) 

r— >oo 

exist. This, together with Lemma [1] then also implies that f3 r+ ^ — /3 r+J+1 —¥ and thus 

lim (3 r+j = d 

so that the limit of the sequence that is shifted by j also exists and is equal to the limit 
of the unshifted sequence. 

Now we want to show that then, 8 is coordinate-wise minimum w.r.t. g. Let 
e k be the vector that is 1 at position k and otherwise. Furthermore let k(r) = 
(r — 1 mod p) + 1 be the coordinate index that is being optimized in step r. Then we 
know that 

g((3 r+j + Xe k{r+j) ) > g(/3 r+j ) VA, Vj G {1, . . . ,p}. 

By using the subsequence that always moves coordinate k and going to the limit we 
then have 

g(8 + Xe k ) >g(S) 

which holds for every k G {1, . . . ,p}. Therefore, d is guaranteed to be a coordinate- wise 
optimum. 

Our proof as it is shown here works for an algorithm that always moves every 
coordinate and does not leave any out. However, in Algorithm [H we included an active 
set A. It can easily be seen that this is not a problem. Our proof works as it is for 
all coefficients that are included in A. Furthermore, A always adds coefficients, for 
which it is not optimal to remain at but never excludes any. Therefore, after a finite 
number of steps in the outer iteration, all variables that are non-zero at their optimum 
are included. This concludes our theorem. □ 



C Proof of Theorem [2 

Before we go to the main proof, we have to show a few lemmata: 

Lemma 2. Let (3 have associated partition ^3 and transformed problem g, for which 
(3 is coordinate-wise optimal. Also, let $ be the corresponding fused sets. Assume that 
Pk — for all k G P, for some i and Pi = Fi, i.e., the set Pi cannot be split (according to 
the "Split inactive set" rule). Then there exist s k G [—1, 1] and t k i G [—1, 1] for k,l G Pi 
for which 

dh/d(3 k + Aiw fc s fc + A 2 w M t M = VA; G Pi (3) 
where E t = {(k, I) G E : k,l G Pi}. 
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Proof. In order to show this, assume that such Sk and t k i do not exist. Now look at the 
Fused Lasso problem 

y~] 7TT- — {dh/d(3 k + AiWfcSfc) 2 + A 2 u>fci|sk-sj|. 

fcGPi K (k,l)€Ei,k<l 

which has its optimum at, say, s° w.r.t. s. It has the subgradients w.r.t. s k 

dh/d(3 k + XiWkS° k + \ 2 ^2 w ki t ki = Vk e Pi 

leEt 

and by optimality we know that the tki G [—1, 1] with tki = sign(sfc — si) for Sk ^ sj. 
By our assumption it follows that there exists a k G Pi with s° ^ [—1, 1] and w.l.o.g. 
we assume that 3s° k > 1. Let 

s = {*K>i} 

Then we know that t k i = 1 for all k G 5" and / ^ 5. Therefore, we have 



= dh/df3k + \iw k s° k + Wkitki ] > 

kes \ i-.iePi,(k,i)eE 

Y dh/df3 k + Aiw fc + Y ^2W k it H 
kes \ idePi\S,(k,i)eE 

where in the inequality we use s° > 1 for k G S as well as t k i = —Uk- Then 

- Y 9h/d/3 k - Ai^fe > Y Y A 2 u>fci 
kes kes i-.iePi\S,(k,i)eE 

where we used that t k i = 1 for all k G S, I £ S. In this equation, on the left hand 
side we have the sum over all capacities coming out of the source into S in Qf^_. On 
the right hand side, we have the sum over all edges from S into Pi\S, which by the 
max-flow-min-cut theorem in graph theory is the maximal flow possible in the graph 
coming out of nodes S. This implies, that the source is connected to at least one node 
in S in Qf +) implying that F; ^ Pi. However this is a contradiction and therefore the 
claim holds. □ 

We also need the following lemma that characterizes (3 where the associated partition 
^3 can be split and those where it cannot. 



Lemma 3. Assume that (3^ has partition and associated transformed (3^ and tar- 
get function g. Also assume that (3^ is coordinate-wise optimal w.r.t g. Furthermore, 
let $ be the fused sets associated with and (3^ m \ Then, the following statements hold 

(a) IfS^V, then /3 (m+1) ± /3 (m) . 

(b) (3 {m) is the global optimum iff<# = $. 
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Proof. Let us first deal with case (jaj). The condition implies, that at least one of the sets 
of *p has been split. Let Pj be a set that has been split and Fi + ^ (otherwise, switch 
signs). Then the condition for splitting a set implies that the coordinate-wise algorithm 
will move at least one of the coordinates. To see this, let fa for k, I G Pj U {r, s} be 
the solution of the with Pi and graph QM associated maximum-flow problem. Then by 
definition of Fi + we know that 

kGF t+ yk k&F l+ 

fsk = ^2 ° ki 

k£F 1+ (k,l):keF i+ ,l£Pi\F i+ ,(k,l)£EM 

where the last inequ ality follows from the max-flow-min-cut theorem of graph theory 
(ICormen et al.l 1200 ll ; Chapter 26). If Fi + is not one connected group, but consists 
of several disconnected components, this is true for every component. Therefore, we 
can just assume that F i+ consists of just one component. By the definition of our 
constrained problem we know that 

<7(X, y, 3 M , w, g, A) = <7(X, y, w, Q, A) 

and 

d+g(X, y, /3, w, G, A) r—d + g(X, y, /3, w, Q, A) 

where we take the directional derivative along the vector v with v k = 1/ a/T^T f° r k £ Fi 
and v k = for k ^ F{. As h by the definition of Pi is differentiable at (3^ w.r.t. v, we 

get 

^ d + c7(X,y,/3( m >,w,g,A) _ 

« ~~ 

ov 

\- ^(X,y,/3 M ,w,g,A) ^ 

fe6F i+ Pfc k&F i+ ,l£Pi\F i+ ;(k,l)eE 

^2 C sk + ^2 /« < 

fcSF l+ keF i+ ,l£Pi\F i+ ;(k,l)eE 

as deduced from the inequality above. Therefore, after applying the coordinate- wise 
optimization, we have at least moved the component of j3 corresponding to Fi + and get 

Now we show (jb]). First, we assume that /3 (m - ) is indeed optimal. From the fact that 
the coordinate-wise algorithm always improves (3^ w.r.t. g, part (jgj) proves that 

5 ^ <P /3 (m) not optimal. 

Negating this just proves the first part of the claim. 

For the opposite direction, assume that J = ^5. In order to show optimality of a 
non-differentiable function, we can use subgradients again. For our case, it says that a 
solution (3 is optimal, if there are Sk for k G 1, . . . ,p and tki with (k, I) G E such that 

- (X T (y - Xf3)) k + X lWk Sk + A 2 J2 w kihi = (4) 

l:(k,l)eE 



22 



for all k. Here, it has to hold that Sk = sign(ft) if ft 7^ and G [—1, 1] for ft = 0. 
Similarly, t M = sign(ft - (3i) for ft 7^ 0, and t M G [-1, 1] for ft = ft. 

As specified before in Section 12. 3[ the sums over the tki can be divided into the tki 
within sets Pj and those between. Then Equation (jlj) is 

dh/d/3 k + \iw k s k + A 2 ^ w k it k i = 0. (5) 

ZSPi:(A:,Z)G£; 

First consider the case where ft ^ for k G Pj. Here we have then that Sk = 
sign(ft). Then let fki for k, I G P« be the solution of graph C?* 1 . From the assumption 
that P{ = Fi, we know that in the residual graph Qf 1 , no node is connected to the source 
or the sink, all edges coming from the source or going into the sink are at maximum 
capacity. Together, this gives for all k G Pj either 

C-sk f sk ^ ^ fkl 

l:l£Pi,(k,l)£E 

or 

Ckr = fkr= £ flk 
l:lePi,(l,k)£E 

depending on if k is connected to the source s or the sink r in Qf 1 and by definition of 
c s k and Ckr then 

= ^ + A lWfe sign(4 m) ) + /« 

for all k E Pi (noting that /jy = —fik). By definition of cm we know that —\2Wki < 
/a/ < A2^fc; and therefore 

-1 < tfcj := < 1 

A2WM 

for all k, I G P«; (k,l) G P. Therefore, with this definition of tki, Equation §5§ holds, 
which implies that the subgradient equations hold. 

Now it remains to show the same for ft = for k G Pi for some i. However, we have 
already shown in Lemma [2] that in this case Equation ([5]) holds, which in turn implies 
that the subgradient equations hold. 

Putting all these cases together, we have shown that the subgradient equations hold 
for all k G Pj and for all 1 < i < and therefore, the optimality of (3 is shown. □ 

Now that we have established conditions for optimality of the results, we only need 
two more, very brief lemmata. 

Lemma 4. Assume that we only allow fusion of sets, but not splits. Then the algorithm 
converges after a finite number of steps. 

Proof. Assume that we are currently in step m with estimate (3^ and associated 
partition ^ rn ^ = ^ {m) (as we only allow fusions, not splits, the equality has to hold). 
Then after the application of the coordinate-wise algorithm we get the next estimate 
£j( m+1 ) w ith partition ^ m+1 \ If q3( m + 1 ) = then we cannot move any further 

without splits after a total of m+ 1 steps. However if ^}( m+1 ) ^ <p( m ) ; the fact that we 
can only fuse coefficients implies that | < p( m+1 )| < |^( m )|. Therefore, at most | < p( m+1 )| 
further steps are possible. Thus, the algorithm converges after a finite number of 
steps. □ 
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Lemma 5. Let (3 have partition with associated transformed coefficients (3 and trans- 
formed problem g. Then there are only finitely many points (3 such that f3 is coordinate- 
wise optimal w.r.t. g. 

Proof. First of all, we note that by the definition of it holds that g is different iable 
w.r.t. ft for all ft ^ and as they are coordinate- wise optimal, <9g/<9ft = 0. Let 

N = {i : ft = 0}. 

As the function g is strictly convex in the variables N c = {1, . . . , |^3|}\A r at point f3 
and thus, fixing all variables in iV at 0, (3 is the global solution for iV c and is unique. 
Therefore, for any fixed partition and set N , there is at most one (3 that is coordinate- 
wise optimal. As there are only finitely many *}3 and N, the claim holds. □ 

After all these lemmata, we now show Theorem [2j 

Proof. First, we note that Lemma H] guarantees that in a finite number of steps, we will 
be able to get to a coordinate-wise optimal point that cannot be moved anymore by 
only fusing sets. Furthermore, Lemma [5] guarantees that there are only finitely many 
such points. In Lemma [U part (jaj) shows that if we are at such a point and we can split 
a set here, then we will move away (and never return as the coordinate-wise procedure 
always improves the loss function). On the other hand, part (jbj) shows that if we cannot 
split a set, then we are at the global optimum. 

Overall, we see at worst, in a finite number of steps we can visit each such coordinate- 
wise optimal and stable point, but will only visit each one at most once and only stop at 
the global optimum, which is trivially one of these coordinate-wise optimal and stable 
points. 

Therefore, the algorithm converges to the global optimum in a finite number of 
steps. □ 
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